Pig [1] is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin[1]. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java and then call directly from the language.
Pig was originally [2] developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large data sets. In 2007[3], it was moved into the Apache Software Foundation.[4]
Below is an example of a "Word Count" program in Pig Latin
A = load '/tmp/my-copy-of-all-pages-on-internet'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = group C by word; E = foreach D generate COUNT(C) as count, group as word; F = order E by count desc; store F into '/tmp/number-of-words-on-internet';
The above program will generate parallel executable tasks which can be distributed across 1,000s of machines in a Hadoop cluster to count the number of words in a dataset such as "all the webpages on the internet".
|